Last week

We learned the big 5 data transformation functions in the dplyr package:

function purpose
filter() Pick observations by their values
arrange() Reorder the rows
select() Pick variables by their names
mutate() Create new variables
summarise() Create summaries

(http://r4ds.had.co.nz/transform.html)

Last week

And we used these functions to create this plot:

Today we’ll focus on tidy data

https://www.rstudio.com/resources/cheatsheets/

Objectives

By the end of this session, you should be able to:

  • Explain what makes data “tidy”, and why messy data suck
  • Tidy data using the tidyr package, a member of the tidyverse

Login to Duke’s Docker-ized version of RStudio Server

  • Login to your instance by going to https://vm-manage.oit.duke.edu/containers and entering your NetID.
  • Click on Docker
  • Click on RStudio
  • When RStudio loads, restart the R session (Ctrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspace

Open your project

Is your project still open? If not, click on the project icon to load it. (Don’t create a new one.)

Optional: Change your layout

Tools > Global Options

Tidy Data

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table (Wickham 2014)

Wickham also uses tidy as a verb meaning “to structure a dataset to facilitate analysis”.

Messy Data 1: Column Headers Are Values

How many variables are in this dataset? This table and the following from Wickham (2014).

Tidy Data 1: Melt

Columns are variables and rows are observations (i.e., combination of religion and income):

Messy Data 2: Multiple variables stored in one column

How would you link up with population data by country and age group to calculate rates? (answer: after tidying)

Tidy Data 2: Melt and Split

Messy Data 3: Variables are stored in both rows and columns

The element column is not a variable with values, but rather a vector of variable names.

Tidy Data 3: Melt and Cast

Every row becomes an observation (i.e., weather station by date) with two measurements:

Messy Data 4: Multiple types in one table

This style of data entry and storage invites errors and inconsistencies:

Tidy Data 4: Use different tables for different levels

Don’t Cry, Tidy!

The Tidy 4

function purpose
gather() Gather variable values spread across multiple columns
spread() Spread out observation values scattered across rows
separate() Split one column into two or more columns
unite() Collapse multiple columns into one column

(http://r4ds.had.co.nz/tidy-data.html)

Some Simple Examples

Load the tidyverse package and look at the included dataset called table4a.

  library(tidyverse)
  table4a
## # A tibble: 3 x 3
##       country `1999` `2000`
## *       <chr>  <int>  <int>
## 1 Afghanistan    745   2666
## 2      Brazil  37737  80488
## 3       China 212258 213766

What is messy about this tibble?

gather()

1999 and 2000 are values, not variables. The variable should be year.

Examples and figures from Wickham and Grolemund (2017)

gather()

  table4a %>%  # remember pipes?
    gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##       country  year  cases
##         <chr> <chr>  <int>
## 1 Afghanistan  1999    745
## 2      Brazil  1999  37737
## 3       China  1999 212258
## 4 Afghanistan  2000   2666
## 5      Brazil  2000  80488
## 6       China  2000 213766

We gather values by a set of column names that define the key, which we name year. We then store the values in a new variable called cases.

gather()

Try gathering table4b (also included with tidyverse) to create variables for year and population values (call it “population”).

  table4b
## # A tibble: 3 x 3
##       country     `1999`     `2000`
## *       <chr>      <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2      Brazil  172006362  174504898
## 3       China 1272915272 1280428583

spread()

Now look at table2. type is not a variable!

  table2
## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

spread()

Let’s use spread() to make two proper variables: cases and population.

spread()

The key in this example is the type column, and the values we want to spread are stored in count.

  table2
## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

spread()

Just like with gather(), we’ll pass two arguments to spread() in addition to an object: key and value.

  spread(table2, key = type, value = count)
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread out wide, or gather (stack) into a long (tall) pile

split and combine columns

The separate() and unite() functions split and combine columns.

separate()

Here we want split rate into two columns: cases and population

separate()

It’s simple. Just tell R the column to split and the columns to create:

  table3 %>% 
    separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <chr>      <chr>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

separate()

separate() is smart enough to know to split on “/”, but you can also specify with sep = "/". The convert==TRUE argument will guess at the data type rather than leave the new columns characters.

  table3 %>% 
    separate(col = rate, into = c("cases", "population"), 
             sep="/", convert = TRUE)
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

separate()

Can also separate on a specific character position with the sep argument. For example, we separate year into century and year by specifying sep==2. Count from left (1, 2, …) or from right (-1, -2, …).

  table3 %>% 
    separate(col = rate, into = c("century", "year"), 
             sep=2, convert = TRUE)
## # A tibble: 6 x 4
##       country  year century            year
## *       <chr> <int>   <int>           <chr>
## 1 Afghanistan  1999      74      5/19987071
## 2 Afghanistan  2000      26     66/20595360
## 3      Brazil  1999      37   737/172006362
## 4      Brazil  2000      80   488/174504898
## 5       China  1999      21 2258/1272915272
## 6       China  2000      21 3766/1280428583

unite()

unite() does the opposite: it combines two or more columns into one.

unite()

In this example, new is the name of the new column we want to create by combining columns century and year.

  table5 %>% 
    unite(col = new, century, year)
## # A tibble: 6 x 3
##       country   new              rate
## *       <chr> <chr>             <chr>
## 1 Afghanistan 19_99      745/19987071
## 2 Afghanistan 20_00     2666/20595360
## 3      Brazil 19_99   37737/172006362
## 4      Brazil 20_00   80488/174504898
## 5       China 19_99 212258/1272915272
## 6       China 20_00 213766/1280428583

unite()

By default unite() will add an underscore, but adding sep="" (no space) will combine without the _.

  table5 %>% 
    unite(col = new, century, year, sep = "")
## # A tibble: 6 x 3
##       country   new              rate
## *       <chr> <chr>             <chr>
## 1 Afghanistan  1999      745/19987071
## 2 Afghanistan  2000     2666/20595360
## 3      Brazil  1999   37737/172006362
## 4      Brazil  2000   80488/174504898
## 5       China  1999 212258/1272915272
## 6       China  2000 213766/1280428583

Missing values

One last point to consider is how to handle missing values when transforming data. Let’s make a tibble called stocks with 2 years of quarterly data on returns.

  stocks <- tibble(
    year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
    qtr    = c(   1,    2,    3,    4,    2,    3,    4),
    return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

Missing values

There are two types of missing data:

  1. Explicit: Q4 of 2015 is present in the data, but there is no value.
  2. Implicit: Q1 of 2016 is not even present in the data

Make missing explicit

complete() returns all combinations of a set of columns and will fill in NA when combinations are missing.

  stocks %>% 
    complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4     NA
## 5  2016     1     NA
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. doi:10.18637/jss.v059.i10.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.